Podcast Reviews Data Analysis
¶
This database for this project is the iTunes Podcast Reviews, sourced from the Scraped iTunes podcast review RSS feeds. It contains the information that spans from 2019 to 2023 across USA, offering valuable time-series data in a specific location.
¶
Business Stakeholders and Objectives
¶
- Podcast creators / Podcast sponsors: Interested in understanding audience preferences and improving their content based on feedback.
- Marketing teams: Aim to identify effective marketing strategies and optimize promotional efforts.
- Data analysts: Responsible for extracting insights from the data to inform decision-making.
¶
- Podcast creators: To identify popular podcast genres/topics and areas for content improvement through analysis of listener engagement and feedback.
- Marketing teams: To analyze trends in podcast listenership and sentiment to inform marketing campaigns and target audience outreach.
- Data analysts: To conduct thorough exploratory analysis to extract actionable insights from the podcast reviews dataset.
import os
from dotenv import load_dotenv
load_dotenv()
from numba import cuda
cuda.detect()
from sqlalchemy import create_engine
import sqlite3
from math import sqrt
%load_ext cudf.pandas
import pandas as pd
import numpy as np
import cudf
from concurrent.futures import ThreadPoolExecutor
from textblob import TextBlob
from collections import Counter
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
nltk.download("vader_lexicon")
nltk.download("punkt")
import scipy.stats as stats
from scipy.stats import (
f_oneway,
chi2_contingency,
norm,
t,
ttest_ind,
kstest,
levene,
shapiro,
kruskal,
)
from scikit_posthocs import posthoc_dunn
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.proportion import proportion_confint
import plotly
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import plotly.subplots as sp
pio.renderers.default = "notebook"
plotly.offline.init_notebook_mode()
Found 1 CUDA devices
id 0 b'NVIDIA GeForce RTX 2060' [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 0
PCI Bus ID: 1
UUID: GPU-d5fffe5d-eda6-e044-a96d-7e29d7648f51
Watchdog: Enabled
FP32/FP64 Performance Ratio: 32
Summary:
1/1 devices are supported
[nltk_data] Downloading package vader_lexicon to [nltk_data] /home/cannelle/nltk_data... [nltk_data] Package vader_lexicon is already up-to-date! [nltk_data] Downloading package punkt to /home/cannelle/nltk_data... [nltk_data] Package punkt is already up-to-date!
from utils.functions import (
print_missing_and_duplicates,
map_to_general_category,
analyze_sentiment,
sample_data_with_min_category_count,
check_independence,
check_sample_sizes,
)
# kaggle_json_filename = "kaggle.json"
# notebook_directory = os.getcwd()
# kaggle_json_path = os.path.join(notebook_directory, kaggle_json_filename)
# if os.path.exists(kaggle_json_path):
# os.environ['KAGGLE_CONFIG_DIR'] = notebook_directory
# import kaggle
# else:
# print("Error: kaggle.json file not found in the project root directory.")
# kaggle.api.authenticate()
# kaggle.api.dataset_download_files(dataset="thoughtvector/podcastreviews", path="./datasets", unzip=True)
# download_path = "./datasets"
# old_file_path = os.path.join(download_path, "database.db")
# new_file_path = os.path.join(download_path, "database.sqlite")
# if os.path.exists(old_file_path):
# os.rename(old_file_path, new_file_path)
cnx = sqlite3.connect("./datasets/database.sqlite")
df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", cnx)
print(df)
name 0 runs 1 podcasts 2 categories 3 reviews
categories = pd.read_sql_query("SELECT * FROM Categories", cnx)
podcasts = pd.read_sql_query("SELECT * FROM Podcasts", cnx)
reviews = pd.read_sql_query("SELECT * FROM Reviews", cnx)
runs = pd.read_sql_query("SELECT * FROM Runs", cnx)
display(categories.head(2))
display(podcasts.head(2))
display(reviews.head(2))
display(runs.head(2))
| podcast_id | category | |
|---|---|---|
| 0 | c61aa81c9b929a66f0c1db6cbe5d8548 | arts |
| 1 | c61aa81c9b929a66f0c1db6cbe5d8548 | arts-performing-arts |
| podcast_id | itunes_id | slug | itunes_url | title | |
|---|---|---|---|---|---|
| 0 | a00018b54eb342567c94dacfb2a3e504 | 1313466221 | scaling-global | https://podcasts.apple.com/us/podcast/scaling-... | Scaling Global |
| 1 | a00043d34e734b09246d17dc5d56f63c | 158973461 | cornerstone-baptist-church-of-orlando | https://podcasts.apple.com/us/podcast/cornerst... | Cornerstone Baptist Church of Orlando |
| podcast_id | title | content | rating | author_id | created_at | |
|---|---|---|---|---|---|---|
| 0 | c61aa81c9b929a66f0c1db6cbe5d8548 | really interesting! | Thanks for providing these insights. Really e... | 5 | F7E5A318989779D | 2018-04-24T12:05:16-07:00 |
| 1 | c61aa81c9b929a66f0c1db6cbe5d8548 | Must listen for anyone interested in the arts!!! | Super excited to see this podcast grow. So man... | 5 | F6BF5472689BD12 | 2018-05-09T18:14:32-07:00 |
| run_at | max_rowid | reviews_added | |
|---|---|---|---|
| 0 | 2021-05-10 02:53:00 | 3266481 | 1215223 |
| 1 | 2021-06-06 21:34:36 | 3300773 | 13139 |
¶
There are 4 tables:
- Categories - Categories data [
podcast_id,category] - Podcasts - Podcasts data [
podcast_id,itunes_id,slug,itunes_url,title] - Reviews - Reviews data [
podcast_id,title,content,rating,author_id,created_at] - Runs - Runs data [
run_at,max_rowid,reviews_added]
- Missing values in a dataset can lead to inaccurate or misleading statistics and machine learning model predictions. They can occur due to various reasons such as data entry errors, failure to collect information, etc. Depending on the nature and extent of these missing values, different strategies can be employed to handle them.
- Duplicate values in a dataset can occur due to various reasons such as data entry errors, merging of datasets, etc. Duplicates can lead to biased or incorrect results in data analysis. Therefore, it’s important to identify and remove duplicates.
print_missing_and_duplicates(categories, "Categories")
print_missing_and_duplicates(podcasts, "Podcasts")
print_missing_and_duplicates(reviews, "Reviews")
print_missing_and_duplicates(runs, "Runs")
Duplicates in Reviews table: 655
Drop Duplicative Rows:¶
reviews = reviews.drop_duplicates()
Chosen Strategy for Organizing Tables:¶
1. Merging Tables:
- The three tables are merged based on the
podcast_idvalue. - The rows are sorted based on this value.
- Merging the tables has been chosen to streamline the workflow to facilitate the use of data for various comparisons.
merged_table = pd.merge(categories, podcasts, on="podcast_id", how="outer")
merged_table = pd.merge(merged_table, reviews, on="podcast_id", how="outer")
merged_table.sort_values(by=["podcast_id"], inplace=True)
merged_table = merged_table.reset_index(drop=True)
2. Preparing the data for display:
- Mapping the category values to fit into one of the categories -
Business & Finance,Religion & Spirituality,News & Politics,Sports & Recreation,Arts,Education,Society & Culture,TV & Film,Health & Fitness,Others,Music,True Crime,Comedy,History,Leisure,Kids & Family,Science,Fiction,Technology,Government
processed_dataset = merged_table.copy()
processed_dataset["category"] = processed_dataset["category"].apply(
map_to_general_category
)
processed_dataset["podcast_title"] = (
processed_dataset["title_x"].fillna("")
+ " "
+ processed_dataset["title_y"].fillna("")
)
processed_dataset.head(2)
| podcast_id | category | itunes_id | slug | itunes_url | title_x | title_y | content | rating | author_id | created_at | podcast_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | a00018b54eb342567c94dacfb2a3e504 | Business & Finance | 1.313466e+09 | scaling-global | https://podcasts.apple.com/us/podcast/scaling-... | Scaling Global | Very informative | Great variety of speakers! | 5 | CC47C85896D423B | 2017-11-29T12:16:43-07:00 | Scaling Global Very informative |
| 1 | a00043d34e734b09246d17dc5d56f63c | Religion & Spirituality | 1.589735e+08 | cornerstone-baptist-church-of-orlando | https://podcasts.apple.com/us/podcast/cornerst... | Cornerstone Baptist Church of Orlando | Good Sernons | I'm a regular listener. I only wish that the ... | 5 | 103CC9DA2046218 | 2019-10-08T04:23:32-07:00 | Cornerstone Baptist Church of Orlando Good Ser... |
OUTCOMES:
- No missing values were found in the datasets.
- 655 duplicates were found in the Reviews table and were subsequently dropped.
- The decision was made to reorganize the data into one table, as this could potentially facilitate analysis in further steps.
¶
Preliminary Plan for Data Exploration
Basic Exploration
- Utilize the
describefunction to provide an overview of numerical and categorical features in each dataset. - Check the distributions of podcasts over categories, ratings, and number of reviews over time.
Detailed Exploration
Data Sampling:
- Preparing a subset of data
Trends Characteristics Analysis:
- Understanding podcast listenership trends
- Identifying popular podcast genres/topics
- Analyzing sentiment of podcast reviews
Statistical Interference:
- Correlation between average ratings and voting counts
- Variances in rating averages across podcast categories
- Monthly variations in rating averages
print("Processed Dataset")
display(processed_dataset.describe(include=["object"]).T)
Processed Dataset
| count | unique | top | freq | |
|---|---|---|---|---|
| podcast_id | 4552196 | 111544 | bf5bf76d5b6ffbf9a31bba4480383b7f | 33100 |
| category | 4552196 | 20 | Society & Culture | 661552 |
| slug | 4527973 | 108919 | crime-junkie | 33100 |
| itunes_url | 4527973 | 110024 | https://podcasts.apple.com/us/podcast/crime-ju... | 33100 |
| title_x | 4527973 | 109274 | Crime Junkie | 33100 |
| title_y | 4552196 | 1138688 | Great podcast | 30828 |
| content | 4552196 | 2049707 | I love this podcast! | 404 |
| author_id | 4552196 | 1475285 | D3307ADEFFA285C | 1660 |
| created_at | 4552196 | 2054352 | 2017-09-19T08:29:49-07:00 | 14 |
| podcast_title | 4552196 | 1868684 | Crime Junkie Obsessed | 466 |
unique_ratings = processed_dataset["rating"].unique()
count_ratings = len(unique_ratings)
top_rating = processed_dataset["rating"].mode().values[0]
top_rating_freq = processed_dataset["rating"].value_counts().max()
total_ratings = processed_dataset["rating"].count()
print("Unique Ratings:", unique_ratings)
print("Count of Unique Ratings:", count_ratings)
print("Most Common Rating (Top):", top_rating)
print("Frequency of Most Common Rating:", top_rating_freq)
print("Total Number of Ratings:", total_ratings)
Unique Ratings: [5 1 4 2 3] Count of Unique Ratings: 5 Most Common Rating (Top): 5 Frequency of Most Common Rating: 3982850 Total Number of Ratings: 4552196
category_counts = processed_dataset["category"].value_counts()
fig = px.bar(
x=category_counts.index,
y=category_counts.values,
labels={"x": "Category", "y": "Count"},
)
fig.update_layout(
title="Podcast Counts by Category",
xaxis_title="Category",
yaxis_title="Podcast Count",
template="plotly_dark",
)
fig.show()
rating_counts = processed_dataset["rating"].value_counts()
fig = px.bar(
x=rating_counts.index, y=rating_counts.values, labels={"x": "Rating", "y": "Count"}
)
fig.update_layout(
title="Podcast Counts by Rating",
xaxis_title="Rating",
yaxis_title="Podcast Count",
template="plotly_dark",
)
fig.show()
runs["run_date"] = pd.to_datetime(runs["run_at"]).dt.date
reviews_added_per_day = runs.groupby("run_date")["reviews_added"].sum().reset_index()
fig = px.line(
reviews_added_per_day,
x="run_date",
y="reviews_added",
title="Reviews Added Over Time",
template="plotly_dark",
)
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Number of Reviews Added")
fig.show()
OUTCOMES:
- Podcast Counts by Category: The Society & Culture category has the highest number of podcasts (661,552k), followed by Business & Finance (435,586k) and Comedy (413,024k). Conversely, the categories with the fewest podcasts are Government (15,483k), Others (25,906k), and Technology (47,808k).
- Podcast Counts by Rating: The majority of podcasts received a rating of 5 (3.98 million), indicating the highest level of satisfaction, while the fewest were rated as 2 (94.62k) on a scale of 1-5.
- Reviews Added Over Time: The highest number of reviews were recorded on May 10, 2021 (exceeding 1.2 million), followed by July 3, 2022 (559,523k).
The sampling method used in this exploratory data analysis (EDA) employs a random sample selection of 10% of the dataset's total rows to strike a balance between representativeness and computational efficiency. This percentage was chosen to ensure adequate coverage of dataset characteristics while minimizing computational resources required. Setting the random state parameter to 42 ensures reproducibility of results across analyses.
sampled_data = sample_data_with_min_category_count(processed_dataset)
display(sampled_data.head(2))
| podcast_id | category | itunes_id | slug | itunes_url | title_x | title_y | content | rating | author_id | created_at | podcast_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4257723 | f85fdf4372bbd041d91178b41cec9c62 | News & Politics | 6.667518e+08 | james-obriens-mystery-hour | https://podcasts.apple.com/us/podcast/james-ob... | James O'Brien's Mystery Hour | How can I possibly be only 2nd review for this... | I live in Kansas City and am not able to liste... | 5 | 9EEF534801232E5 | 2016-05-19 12:28:52-07:00 | James O'Brien's Mystery Hour How can I possibl... |
| 460510 | aa41a90ae2ebfae71eb887fe9375b6d5 | News & Politics | 1.458648e+09 | hear-the-bern | https://podcasts.apple.com/us/podcast/hear-the... | Hear the Bern | I’m a big Bernie supporter but was still surpr... | So you would assume a podcast about a presiden... | 5 | 88BD551F001F391 | 2019-09-08 18:46:28-07:00 | Hear the Bern I’m a big Bernie supporter but w... |
most_rated_query_df = (
sampled_data.groupby(["podcast_title"])
.agg({"rating": ["count", "mean"]})
.reset_index()
)
most_rated_query_df.columns = ["podcast_title", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
by="rating_count", ascending=False
).head(10)
most_rated_query_df_count = most_rated_query_df.sort_values(
by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
by="avg_rating", ascending=False
)
fig1 = px.bar(
most_rated_query_df_count,
x="podcast_title",
y="rating_count",
title="Top 10 Podcasts by Review Frequency",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Title")
fig1.update_yaxes(title="Number of Reviews Received")
fig2 = px.bar(
best_rated_query_df_avg,
x="podcast_title",
y="avg_rating",
title="Top 10 Podcasts by Average Rating",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Title")
fig2.update_yaxes(title="Average Rating")
fig1.show()
fig2.show()
most_rated_query_df = (
sampled_data.groupby(["category"]).agg({"rating": ["count", "mean"]}).reset_index()
)
most_rated_query_df.columns = ["category", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
by="rating_count", ascending=False
).head(10)
most_rated_query_df_count = most_rated_query_df.sort_values(
by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
by="avg_rating", ascending=False
)
fig1 = px.bar(
most_rated_query_df_count,
x="category",
y="rating_count",
title="Top 10 Categories by Review Frequency",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Category")
fig1.update_yaxes(title="Number of Reviews Received")
fig2 = px.bar(
best_rated_query_df_avg,
x="category",
y="avg_rating",
title="Top 10 Categories by Average Rating",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Category")
fig2.update_yaxes(title="Average Rating")
fig1.show()
fig2.show()
sampled_data["sentiment"] = sampled_data["content"].apply(analyze_sentiment)
fig = px.histogram(
sampled_data,
x="sentiment",
nbins=30,
title="Sentiment Distribution of Podcast Reviews",
)
fig.update_layout(
xaxis_title="Sentiment Polarity",
yaxis_title="Frequency",
bargap=0.05,
template="plotly_dark",
)
fig.show()
OUTCOMES:
- The most rated podcasts were "Crime Junkie Obsessed" with 48 ratings, "Crime Junkie Amazing" with 33 ratings, and "Crime Junkie Love!" with 29 ratings. These podcasts had respective average ratings of 5.00, 5.00, and 4.93.
- The podcasts with the highest average ratings were "Crime Junkie Obsessed", "Crime Junkie Amazing", "Crime Junkie Obsessed", "Crime Junkie Amazing!", "Awesome", "Crime Junkie Addicted" all achieving a perfect average rating of 5.0.
- The most rated podcast category was "Society & Culture," which amassed over 32 thousand ratings out of more than 4.5 million total ratings, accounting for approximately 0.7 % of the total. Other highly rated categories include "Sports & Recreation", with over 17 thousand ratings (0.3% of the total), and "TV & Film", with more than 15 thousand ratings (0.3%).
- Podcasts in the Business & Finance category achieved the highest average ratings, with an impressive average of 4.87. Following closely are podcasts in the Religion & Spirituality category and Education category, both with an average rating of 4.83.
- The sentiment analysis indicates a predominantly positive sentiment in podcast reviews, with the majority of ratings falling within the positive range.
INSIGHTS:
- The most rated podcasts have accumulated more than 40 ratings each, with highly favorable average ratings ranging between 5.00 and 4.93.
- Podcasts consistently rated with an average of 5.0 not only attract a substantial audience but also consistently deliver content that resonates exceptionally well, resulting in consistently high average ratings across episodes.
- While "Society & Culture" may have the most rated podcasts, along with "Sports & Recreation", and "TV & Film" attracting a significant number of ratings, it's noteworthy that podcasts in the Business & Finance category receive the highest average ratings. This suggests that listeners highly appreciate the quality and value of content offered in this genre. Similarly, Religion & Spirituality and Education categories also boast exceptionally high average ratings, indicating strong listener satisfaction in these areas.
- The distribution of sentiment analysis highlights a predominantly positive sentiment in podcast reviews, indicating high overall satisfaction among listeners.
⇡¶
The goal of Statistical Inference for Podcast Review EDA is to study the impact of podcast parameters on the distribution of their ratings. The differences in podcast categories' ratings, the number of people voting for them, and the existence of specific time periods associated with higher or lower ratings will be studied.Category Rating Differences
This analysis focuses on examining whether there are rating differences between categories in the database.
Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.
Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.
Confidence Intervals:
The confidence intervals for the mean ratings within podcast categories are as follows:
Arts: [4.686, 4.733]
Business & Finance: [4.860, 4.881]
Comedy: [4.664, 4.696]
Education: [4.815, 4.843]
Fiction: [4.569, 4.649]
Government: [4.496, 4.677]
Health & Fitness: [4.781, 4.807]
History: [4.667, 4.745]
Kids & Family: [4.698, 4.740]
Leisure: [4.750, 4.784]
Music: [4.758, 4.802]
News & Politics: [4.234, 4.281]
Others: [4.631, 4.708]
Religion & Spirituality: [4.822, 4.848]
Science: [4.555, 4.627]
Society & Culture: [4.641, 4.663]
Sports & Recreation: [4.625, 4.656]
TV & Film: [4.531, 4.566]
Technology: [4.556, 4.631]
True Crime: [4.147, 4.195]
Each interval provides a range of plausible values for the true population parameter, the mean rating for podcasts within the corresponding category, with a specified level of confidence of 95%. For example, in the 'Arts' category, we can be 95% confident that the true mean rating falls within the interval [4.686, 4.733]. These confidence intervals help to assess the variability in ratings across different podcast categories.
mean_ratings = sampled_data.groupby("category")["rating"].mean()
sample_size = sampled_data.groupby("category")["rating"].count()
sample_std = sampled_data.groupby("category")["rating"].std()
confidence_level = 0.95
t_value = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)
zero_std_mask = sample_std == 0
sample_std[zero_std_mask] = np.nan
sample_std_filled = sample_std.fillna(0)
small_sample_mask = sample_size < 2
t_value[small_sample_mask] = 0
sample_std[small_sample_mask] = np.nan
sample_std_filled = sample_std.fillna(0)
margin_of_error = t_value * sample_std_filled / np.sqrt(sample_size)
confidence_interval_means = (
mean_ratings - margin_of_error,
mean_ratings + margin_of_error,
)
print("Confidence Interval for Mean Rating within Categories:")
display(confidence_interval_means)
lower_bounds = confidence_interval_means[0]
upper_bounds = confidence_interval_means[1]
overall_lower_bound = np.min(lower_bounds)
overall_upper_bound = np.max(upper_bounds)
print("Overall Lower Bound:", overall_lower_bound)
print("Overall Upper Bound:", overall_upper_bound)
Confidence Interval for Mean Rating within Categories:
(category Arts 4.686810 Business & Finance 4.860490 Comedy 4.663586 Education 4.814655 Fiction 4.568826 Government 4.496118 Health & Fitness 4.780999 History 4.666550 Kids & Family 4.697983 Leisure 4.749629 Music 4.758497 News & Politics 4.233139 Others 4.631171 Religion & Spirituality 4.821502 Science 4.554748 Society & Culture 4.641353 Sports & Recreation 4.624880 TV & Film 4.530586 Technology 4.556431 True Crime 4.147117 Name: rating, dtype: float64, category Arts 4.732914 Business & Finance 4.881052 Comedy 4.696280 Education 4.842816 Fiction 4.648811 Government 4.676677 Health & Fitness 4.807036 History 4.745215 Kids & Family 4.740142 Leisure 4.784429 Music 4.802010 News & Politics 4.280996 Others 4.707833 Religion & Spirituality 4.847650 Science 4.627287 Society & Culture 4.663252 Sports & Recreation 4.655529 TV & Film 4.566012 Technology 4.630872 True Crime 4.195473 Name: rating, dtype: float64)
Overall Lower Bound: 4.147117490111806 Overall Upper Bound: 4.88105228429321
Data Distribution Check:
grouped_ratings = sampled_data.groupby("category")["rating"]
for category, ratings in grouped_ratings:
print(f"Category: {category}")
stat, p_value = shapiro(ratings)
print("Shapiro-Wilk Test p-value:", p_value)
stat, p_value = kstest(ratings, "norm")
print("Kolmogorov-Smirnov Test p-value:", p_value)
print("\n")
Category: Arts Shapiro-Wilk Test p-value: 4.435199230339896e-91 Kolmogorov-Smirnov Test p-value: 0.0 Category: Business & Finance Shapiro-Wilk Test p-value: 7.015005998773669e-119 Kolmogorov-Smirnov Test p-value: 0.0 Category: Comedy Shapiro-Wilk Test p-value: 3.0021366917472324e-114 Kolmogorov-Smirnov Test p-value: 0.0 Category: Education Shapiro-Wilk Test p-value: 8.955847023207261e-109 Kolmogorov-Smirnov Test p-value: 0.0 Category: Fiction Shapiro-Wilk Test p-value: 2.0842131917784576e-66 Kolmogorov-Smirnov Test p-value: 0.0 Category: Government Shapiro-Wilk Test p-value: 3.660556526348004e-38 Kolmogorov-Smirnov Test p-value: 0.0 Category: Health & Fitness Shapiro-Wilk Test p-value: 2.012532254502817e-116
/tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 6104. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14099. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14372. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 10294. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14108. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 6998. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 8559.
Kolmogorov-Smirnov Test p-value: 0.0 Category: History Shapiro-Wilk Test p-value: 1.4460126372126276e-63 Kolmogorov-Smirnov Test p-value: 0.0 Category: Kids & Family Shapiro-Wilk Test p-value: 9.028239730270882e-95 Kolmogorov-Smirnov Test p-value: 0.0 Category: Leisure Shapiro-Wilk Test p-value: 2.0127116571910826e-101 Kolmogorov-Smirnov Test p-value: 0.0 Category: Music Shapiro-Wilk Test p-value: 1.0564596796484096e-86 Kolmogorov-Smirnov Test p-value: 0.0 Category: News & Politics Shapiro-Wilk Test p-value: 3.693276080744102e-103 Kolmogorov-Smirnov Test p-value: 0.0 Category: Others Shapiro-Wilk Test p-value: 3.139281959947661e-68 Kolmogorov-Smirnov Test p-value: 0.0 Category: Religion & Spirituality Shapiro-Wilk Test p-value: 6.3531948911195225e-111 Kolmogorov-Smirnov Test p-value: 0.0 Category: Science Shapiro-Wilk Test p-value: 1.0528319397022419e-73 Kolmogorov-Smirnov Test p-value: 0.0 Category: Society & Culture Shapiro-Wilk Test p-value: 2.0872461061468374e-136 Kolmogorov-Smirnov Test p-value: 0.0 Category: Sports & Recreation
Shapiro-Wilk Test p-value: 1.614874020997192e-118 Kolmogorov-Smirnov Test p-value: 0.0 Category: TV & Film Shapiro-Wilk Test p-value: 8.852778681315428e-113 Kolmogorov-Smirnov Test p-value: 0.0 Category: Technology Shapiro-Wilk Test p-value: 1.6473226779339912e-72 Kolmogorov-Smirnov Test p-value: 0.0 Category: True Crime Shapiro-Wilk Test p-value: 1.7749459108517478e-100 Kolmogorov-Smirnov Test p-value: 0.0
/tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 13619. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 11135. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 32836. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17610. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 15787. /tmp/ipykernel_7695/2213632488.py:6: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 13719.
Statistical Hypotheses:
Null Hypothesis (H0): There are no significant differences in rating averages among categories.
Alternative Hypothesis (H1): There are significant differences in rating averages among categories.
Hypothesis Testing:
To test the hypothesis regarding the difference in average ratings between different podcast categories, the Kruskal-Wallis test was conducted. This choice was made due to violations of the normality assumption observed in the data, as indicated by low p-values from the Shapiro-Wilk and Kolmogorov-Smirnov tests. The resulting test statistic of 7045.29 and a p-value close to 0.0 provided strong evidence against the null hypothesis, indicating significant differences in average ratings between podcast categories. By upholding the assumptions and interpreting the results, we conclude that there are meaningful disparities in ratings across different podcast categories.
category_groups = [group["rating"] for _, group in sampled_data.groupby("category")]
h_statistic, p_value = kruskal(*category_groups)
print(f"P-value is: {p_value}, and test statistic is: {h_statistic}")
alpha = 0.05
if p_value < alpha:
print(
"Kruskal-Wallis Test: There are significant differences in rating averages among categories."
)
else:
print(
"Kruskal-Wallis Test: No significant differences in rating averages among categories were found."
)
P-value is: 0.0, and test statistic is: 7045.288475815118 Kruskal-Wallis Test: There are significant differences in rating averages among categories.
dunn_results = posthoc_dunn(
sampled_data, val_col="rating", group_col="category", p_adjust="bonferroni"
)
print(dunn_results)
Arts Business & Finance Comedy \
Arts 1.000000e+00 6.547332e-29 1.000000e+00
Business & Finance 6.547332e-29 1.000000e+00 6.339745e-47
Comedy 1.000000e+00 6.339745e-47 1.000000e+00
Education 1.739472e-15 4.632711e-01 1.775443e-23
Fiction 7.624549e-06 2.889253e-43 1.401408e-07
Government 2.138811e-01 2.429155e-11 1.236089e-01
Health & Fitness 1.678479e-08 7.772101e-09 9.263626e-14
History 1.000000e+00 6.314510e-15 1.000000e+00
Kids & Family 1.000000e+00 1.313875e-30 1.000000e+00
Leisure 7.412572e-02 9.657623e-16 7.257435e-03
Music 1.000000e+00 3.651925e-12 7.458100e-01
News & Politics 1.802237e-153 0.000000e+00 9.913747e-259
Others 1.000000e+00 3.298722e-22 1.000000e+00
Religion & Spirituality 6.765378e-16 3.452566e-01 1.699017e-24
Science 4.432943e-06 4.643807e-52 2.456006e-08
Society & Culture 2.912177e-01 1.422405e-105 4.170844e-04
Sports & Recreation 1.031376e-01 2.785183e-89 2.301198e-04
TV & Film 2.467110e-21 5.671723e-175 1.708955e-38
Technology 1.726284e-06 1.046303e-51 9.140113e-09
True Crime 1.675757e-248 0.000000e+00 0.000000e+00
Education Fiction Government \
Arts 1.739472e-15 7.624549e-06 2.138811e-01
Business & Finance 4.632711e-01 2.889253e-43 2.429155e-11
Comedy 1.775443e-23 1.401408e-07 1.236089e-01
Education 1.000000e+00 4.416969e-31 1.926443e-08
Fiction 4.416969e-31 1.000000e+00 1.000000e+00
Government 1.926443e-08 1.000000e+00 1.000000e+00
Health & Fitness 4.605610e-01 8.669419e-24 3.855833e-06
History 5.527620e-09 8.417708e-02 1.000000e+00
Kids & Family 4.529617e-16 1.849270e-06 1.544410e-01
Leisure 1.144903e-05 1.477820e-14 6.627118e-04
Music 8.278227e-05 3.289272e-11 2.569645e-03
News & Politics 0.000000e+00 6.302417e-35 3.052015e-07
Others 4.993615e-14 4.735407e-01 1.000000e+00
Religion & Spirituality 1.000000e+00 1.711756e-31 1.842690e-08
Science 1.792636e-36 1.000000e+00 1.000000e+00
Society & Culture 1.203963e-56 6.091772e-03 1.000000e+00
Sports & Recreation 1.106336e-50 4.060352e-02 1.000000e+00
TV & Film 2.074659e-113 1.000000e+00 1.000000e+00
Technology 1.528424e-36 1.000000e+00 1.000000e+00
True Crime 0.000000e+00 8.344859e-69 1.960076e-15
Health & Fitness History Kids & Family \
Arts 1.678479e-08 1.000000e+00 1.000000e+00
Business & Finance 7.772101e-09 6.314510e-15 1.313875e-30
Comedy 9.263626e-14 1.000000e+00 1.000000e+00
Education 4.605610e-01 5.527620e-09 4.529617e-16
Fiction 8.669419e-24 8.417708e-02 1.849270e-06
Government 3.855833e-06 1.000000e+00 1.544410e-01
Health & Fitness 1.000000e+00 4.559223e-05 9.560801e-09
History 4.559223e-05 1.000000e+00 1.000000e+00
Kids & Family 9.560801e-09 1.000000e+00 1.000000e+00
Leisure 6.748705e-01 1.415928e-01 9.456220e-02
Music 6.673752e-01 8.575344e-01 1.000000e+00
News & Politics 0.000000e+00 4.131891e-55 4.538777e-171
Others 7.292406e-09 1.000000e+00 1.000000e+00
Religion & Spirituality 3.908836e-01 4.483518e-09 1.546357e-16
Science 5.776229e-28 1.513078e-01 8.093922e-07
Society & Culture 7.646485e-44 1.000000e+00 5.932844e-02
Sports & Recreation 2.558988e-38 1.000000e+00 2.076450e-02
TV & Film 2.970267e-101 2.436026e-05 1.233376e-24
Technology 3.717438e-28 8.244841e-02 3.088103e-07
True Crime 0.000000e+00 8.203795e-92 4.355996e-276
Leisure Music News & Politics \
Arts 7.412572e-02 1.000000e+00 1.802237e-153
Business & Finance 9.657623e-16 3.651925e-12 0.000000e+00
Comedy 7.257435e-03 7.458100e-01 9.913747e-259
Education 1.144903e-05 8.278227e-05 0.000000e+00
Fiction 1.477820e-14 3.289272e-11 6.302417e-35
Government 6.627118e-04 2.569645e-03 3.052015e-07
Health & Fitness 6.748705e-01 6.673752e-01 0.000000e+00
History 1.415928e-01 8.575344e-01 4.131891e-55
Kids & Family 9.456220e-02 1.000000e+00 4.538777e-171
Leisure 1.000000e+00 1.000000e+00 4.636785e-251
Music 1.000000e+00 1.000000e+00 1.079902e-165
News & Politics 4.636785e-251 1.079902e-165 1.000000e+00
Others 9.540704e-04 1.988885e-02 3.944493e-60
Religion & Spirituality 7.366326e-06 6.419010e-05 0.000000e+00
Science 2.319729e-16 4.524920e-12 8.049610e-49
Society & Culture 2.669238e-15 1.039558e-07 3.832457e-279
Sports & Recreation 7.920740e-15 4.659819e-08 5.848267e-214
TV & Film 4.408084e-53 1.073869e-32 4.873295e-106
Technology 9.371786e-17 1.694069e-12 7.699651e-45
True Crime 0.000000e+00 1.094172e-255 9.328188e-18
Others Religion & Spirituality Science \
Arts 1.000000e+00 6.765378e-16 4.432943e-06
Business & Finance 3.298722e-22 3.452566e-01 4.643807e-52
Comedy 1.000000e+00 1.699017e-24 2.456006e-08
Education 4.993615e-14 1.000000e+00 1.792636e-36
Fiction 4.735407e-01 1.711756e-31 1.000000e+00
Government 1.000000e+00 1.842690e-08 1.000000e+00
Health & Fitness 7.292406e-09 3.908836e-01 5.776229e-28
History 1.000000e+00 4.483518e-09 1.513078e-01
Kids & Family 1.000000e+00 1.546357e-16 8.093922e-07
Leisure 9.540704e-04 7.366326e-06 2.319729e-16
Music 1.988885e-02 6.419010e-05 4.524920e-12
News & Politics 3.944493e-60 0.000000e+00 8.049610e-49
Others 1.000000e+00 3.306089e-14 8.731518e-01
Religion & Spirituality 3.306089e-14 1.000000e+00 4.201078e-37
Science 8.731518e-01 4.201078e-37 1.000000e+00
Society & Culture 1.000000e+00 5.741894e-60 5.402766e-03
Sports & Recreation 1.000000e+00 4.183075e-53 5.099429e-02
TV & Film 1.789166e-04 1.175773e-118 1.000000e+00
Technology 4.908211e-01 3.807213e-37 1.000000e+00
True Crime 1.214218e-102 0.000000e+00 6.849347e-94
Society & Culture Sports & Recreation \
Arts 2.912177e-01 1.031376e-01
Business & Finance 1.422405e-105 2.785183e-89
Comedy 4.170844e-04 2.301198e-04
Education 1.203963e-56 1.106336e-50
Fiction 6.091772e-03 4.060352e-02
Government 1.000000e+00 1.000000e+00
Health & Fitness 7.646485e-44 2.558988e-38
History 1.000000e+00 1.000000e+00
Kids & Family 5.932844e-02 2.076450e-02
Leisure 2.669238e-15 7.920740e-15
Music 1.039558e-07 4.659819e-08
News & Politics 3.832457e-279 5.848267e-214
Others 1.000000e+00 1.000000e+00
Religion & Spirituality 5.741894e-60 4.183075e-53
Science 5.402766e-03 5.099429e-02
Society & Culture 1.000000e+00 1.000000e+00
Sports & Recreation 1.000000e+00 1.000000e+00
TV & Film 5.367599e-26 1.947548e-17
Technology 2.159556e-03 2.184989e-02
True Crime 0.000000e+00 0.000000e+00
TV & Film Technology True Crime
Arts 2.467110e-21 1.726284e-06 1.675757e-248
Business & Finance 5.671723e-175 1.046303e-51 0.000000e+00
Comedy 1.708955e-38 9.140113e-09 0.000000e+00
Education 2.074659e-113 1.528424e-36 0.000000e+00
Fiction 1.000000e+00 1.000000e+00 8.344859e-69
Government 1.000000e+00 1.000000e+00 1.960076e-15
Health & Fitness 2.970267e-101 3.717438e-28 0.000000e+00
History 2.436026e-05 8.244841e-02 8.203795e-92
Kids & Family 1.233376e-24 3.088103e-07 4.355996e-276
Leisure 4.408084e-53 9.371786e-17 0.000000e+00
Music 1.073869e-32 1.694069e-12 1.094172e-255
News & Politics 4.873295e-106 7.699651e-45 9.328188e-18
Others 1.789166e-04 4.908211e-01 1.214218e-102
Religion & Spirituality 1.175773e-118 3.807213e-37 0.000000e+00
Science 1.000000e+00 1.000000e+00 6.849347e-94
Society & Culture 5.367599e-26 2.159556e-03 0.000000e+00
Sports & Recreation 1.947548e-17 2.184989e-02 0.000000e+00
TV & Film 1.000000e+00 1.000000e+00 1.353873e-217
Technology 1.000000e+00 1.000000e+00 3.030409e-87
True Crime 1.353873e-217 3.030409e-87 1.000000e+00
Categories Rating Count Differences
This analysis focuses on examining whether there are rating count differences between podcast categories in the database.
Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.
Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.
Confidence Intervals:
The confidence intervals for the count of ratings within podcast categories are as follows:
Arts: (0.999057, 1.000314) Business & Finance: (0.999591, 1.000136) Comedy: (0.999599, 1.000134) Education: (0.999440, 1.000187) Fiction: (0.997673, 1.000776) Government: (0.989482, 1.003506) Health & Fitness: (0.999592, 1.000136) History: (0.997084, 1.000972) Kids & Family: (0.999177, 1.000274) Leisure: (0.999327, 1.000224) Music: (0.998823, 1.000392) News & Politics: (0.999577, 1.000141) Others: (0.997670, 1.000777) Religion & Spirituality: (0.999483, 1.000172) Science: (0.998288, 1.000571) Society & Culture: (0.999825, 1.000058) Sports & Recreation: (0.999673, 1.000109) TV & Film: (0.999635, 1.000122) Technology: (0.998191, 1.000603) True Crime: (0.999580, 1.000140) Each interval provides a range of plausible values for the true population parameter, the count of ratings for podcasts within the corresponding category, with a specified level of confidence of 95%. For example, in the 'Arts' category, we can be 95% confident that the true count of ratings falls within the interval (0.999057, 1.000314). These confidence intervals help to understand the variability in the count of ratings across different podcast categories.
category_counts = sampled_data.groupby("category")["rating"].count()
confidence_intervals = []
for count, size in zip(category_counts, sampled_data.groupby("category").size()):
if size < 2:
ci_low, ci_high = count, count
else:
ci_low, ci_high = proportion_confint(
count, size, alpha=1 - confidence_level, method="wilson"
)
if ci_low == ci_high:
margin_of_error = 0
else:
margin_of_error = (ci_high - ci_low) / 2
ci_low_adjusted = max(0, ci_low - margin_of_error)
ci_high_adjusted = ci_high + margin_of_error
confidence_intervals.append((ci_low_adjusted, ci_high_adjusted))
print("Confidence Intervals for Count of Ratings within Categories:")
for category, ci in zip(category_counts.index, confidence_intervals):
print(f"{category}: {ci}")
Confidence Intervals for Count of Ratings within Categories: Arts: (0.9990565917157656, 1.0003144694280781) Business & Finance: (0.999591416506534, 1.000136194497822) Comedy: (0.9995991755858225, 1.0001336081380594) Education: (0.9994404469855083, 1.0001865176714972) Fiction: (0.9976726344045528, 1.0007757885318156) Government: (0.9894820150277689, 1.0035059949907437) Health & Fitness: (0.999591677085669, 1.0001361076381103) History: (0.9970836788522089, 1.0009721070492636) Kids & Family: (0.9991770467433561, 1.0002743177522146) Leisure: (0.9993270705455949, 1.000224309818135) Music: (0.998823044357235, 1.000392318547588) News & Politics: (0.9995770200916998, 1.0001409933027667) Others: (0.9976698108928547, 1.0007767297023817) Religion & Spirituality: (0.99948269411569, 1.0001724352947698) Science: (0.9982880393204676, 1.0005706535598442) Society & Culture: (0.9998245366611082, 1.0000584877796306) Sports & Recreation: (0.9996728602193614, 1.000109046593546) TV & Film: (0.9996350930223659, 1.0001216356592117) Technology: (0.9981913135648709, 1.0006028954783766) True Crime: (0.9995801023972818, 1.0001399658675725)
Data Sample Sizes & Independence of Observations Check:
contingency_table = pd.crosstab(sampled_data["category"], sampled_data["rating"])
independence_check_passed = check_independence(contingency_table)
appropriate_sample_sizes_passed = check_sample_sizes(contingency_table)
if independence_check_passed and appropriate_sample_sizes_passed:
print("Assumptions for the chi-square test are met.")
else:
print(
"Assumptions for the chi-square test are not fully met. Further examination may be required."
)
Assumptions for the chi-square test are met.
Statistical Hypotheses:
Null Hypothesis (H0): There is no difference in average number of people voting for podcast within existing categories.
Alternative Hypothesis (H1): There is a difference in average number of people voting for podcast within existing categories.
Hypothesis Testing:
To investigate the differences in the average number of people voting for podcasts within categories, the chi-square test was conducted. This choice was justified by the categorical nature of the data, aligning with the test's suitability for analyzing counts or proportions across categorical variables. Prior to conducting the test, we ensured that the assumptions of the chi-square test, including the independence of observations and appropriate sample sizes, were met. The resulting test statistic of 7812.36, coupled with a p-value close to 0.0, provided compelling evidence against the null hypothesis, indicating significant differences in the average number of people voting for podcasts within categories. This rigorous approach to hypothesis testing enables us to conclude that meaningful disparities exist in the voting patterns across various podcast genres.
contingency_table = pd.crosstab(sampled_data["category"], sampled_data["rating"])
chi2_stat, p_val, dof, expected = stats.chi2_contingency(contingency_table)
alpha = 0.05
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_val)
print("Degrees of Freedom:", dof)
if p_val < alpha:
print(
"Reject the null hypothesis. There is a significant difference between ratings within podcast categories."
)
else:
print(
"Fail to reject the null hypothesis. There is no significant difference between ratings within podcast categories."
)
print("Expected Frequencies Table:")
print(expected[:5])
Chi-square Statistic: 7812.363250371442 P-value: 0.0 Degrees of Freedom: 76 Reject the null hypothesis. There is a significant difference between ratings within podcast categories. Expected Frequencies Table: [[ 332.62953673 130.71659946 148.21530992 179.51582018 5312.92273372] [ 768.30665765 301.92879026 342.34725664 414.64507679 12271.77221866] [ 783.18343739 307.77506019 348.97615238 422.67388068 12509.39146937] [ 560.95813418 220.44506468 249.95550464 302.74178456 8959.89951194] [ 134.70842313 52.93765299 60.02428672 72.70037803 2151.62925913]]
Temporal Analysis
This analysis focuses on examining whether there are rating mean differences between months in a database.
Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.
Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.
Confidence Intervals:
The confidence intervals for the mean rating by each month are as follows:
Month 1: (4.615, 4.634)
Month 2: (4.639, 4.658)
Month 3: (4.632, 4.651)
Month 4: (4.657, 4.676)
Month 5: (4.653, 4.672)
Month 6: (4.619, 4.638)
Month 7: (4.629, 4.648)
Month 8: (4.622, 4.641)
Month 9: (4.637, 4.655)
Month 10: (4.628, 4.646)
Month 11: (4.627, 4.646)
Month 12: (4.606, 4.624)
Each interval provides a range of plausible values for the true population parameter, the mean rating score for all podcasts in the dataset for the corresponding month, with a specified level of confidence of 95%. The lower bound of each confidence interval represents the lower estimate of the mean rating score, while the upper bound represents the upper estimate. We can be 95% confident that the true mean rating score for each month falls within its respective interval.
sampled_data["created_at"] = pd.to_datetime(sampled_data["created_at"])
sampled_data["month"] = sampled_data["created_at"].dt.month
monthly_rating = sampled_data.groupby("month")["rating"].mean()
mean_rating = monthly_rating.mean()
std_rating = monthly_rating.std()
std_error = std_rating / len(monthly_rating) ** 0.5
# T-score for 95% confidence level
t_score = t.ppf(0.975, df=len(monthly_rating) - 1)
confidence_intervals = []
for month, rating in monthly_rating.items():
lower_bound = rating - t_score * std_error
upper_bound = rating + t_score * std_error
confidence_intervals.append((month, lower_bound, upper_bound))
print("Confidence intervals for mean rating by each month:")
for month, lower, upper in confidence_intervals:
print(f"Month {month}: ({lower}, {upper})")
Confidence intervals for mean rating by each month: Month 1: (4.614953467004247, 4.633840158851438) Month 2: (4.639409311081235, 4.658296002928426) Month 3: (4.631804807695311, 4.650691499542503) Month 4: (4.657015545804884, 4.675902237652076) Month 5: (4.65336809670944, 4.672254788556631) Month 6: (4.61902761321306, 4.637914305060251) Month 7: (4.629482423539289, 4.648369115386481) Month 8: (4.622132565474651, 4.641019257321842) Month 9: (4.636555840548599, 4.65544253239579) Month 10: (4.627515799051409, 4.6464024908986) Month 11: (4.627207014529546, 4.646093706376737) Month 12: (4.6055529016031835, 4.624439593450375)
Data Normality and Homogeneity of Variances Tests:
month_groups = [group["rating"] for _, group in sampled_data.groupby("month")]
shapiro_p_values = [shapiro(rating)[1] for rating in month_groups]
print("Shapiro-Wilk Test p-values:", shapiro_p_values)
levene_stat, levene_p_value = levene(*month_groups)
print("Levene's Test p-value:", levene_p_value)
/tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17822. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16560. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16248. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16043. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16255.
/tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16494. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16570. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17336. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17209. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17403. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 15536. /tmp/ipykernel_7695/3141775468.py:3: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14657.
Shapiro-Wilk Test p-values: [1.8747089767737397e-118, 7.961967357139668e-117, 2.954699448513843e-116, 1.9349365936573582e-116, 1.135387911507323e-116, 3.224140871100971e-116, 1.3108350380063052e-116, 1.4692818787365716e-117, 1.0922381339800924e-117, 6.496702124256376e-118, 1.3327205331084589e-114, 1.9856925052677519e-112] Levene's Test p-value: 0.0001316362139861504
Statistical Hypotheses:
Null Hypothesis (H0): There are specific time periods are associated with higher or lower ratings.
Alternative Hypothesis (H1): There are no specific time periods are associated with higher or lower ratings.
Hypothesis Testing:
To examine the relationship between review time periods and ratings, the Kruskal-Wallis test was employed. This choice was made due to concerns regarding the normality assumption for the data, as indicated by the Shapiro-Wilk test results. The Kruskal-Wallis test is a non-parametric alternative to ANOVA and is well-suited for analyzing the effects of a single categorical factor (time periods) on a continuous outcome variable (ratings) when the normality assumption is violated. The resulting test statistic of 40.14 and a p-value close to 0.0 provided compelling evidence against the null hypothesis, indicating that specific time periods are indeed associated with higher or lower ratings. This robust approach to hypothesis testing underscores the significance of time periods in influencing ratings and enhances our understanding of temporal dynamics in podcast reviews.
month_groups = [group["rating"] for _, group in sampled_data.groupby("month")]
h_statistic, p_value = kruskal(*month_groups)
print(f"P-value is: {p_value}, and test statistic is: {h_statistic}")
if p_value < 0.05:
print("There are significant differences in ratings among different months.")
else:
print("No significant differences in ratings among different months were found.")
P-value is: 3.385299573556052e-05, and test statistic is: 40.14017461949268 There are significant differences in ratings among different months.
dunn_results = posthoc_dunn(
sampled_data, val_col="rating", group_col="month", p_adjust="bonferroni"
)
print(dunn_results)
1 2 3 4 5 6 7 \
1 1.000000 1.000000 1.000000 0.042600 0.336116 1.000000 1.000000
2 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
4 0.042600 1.000000 1.000000 1.000000 1.000000 0.070442 0.508347
5 0.336116 1.000000 1.000000 1.000000 1.000000 0.497301 1.000000
6 1.000000 1.000000 1.000000 0.070442 0.497301 1.000000 1.000000
7 1.000000 1.000000 1.000000 0.508347 1.000000 1.000000 1.000000
8 1.000000 1.000000 1.000000 0.018038 0.159561 1.000000 1.000000
9 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
10 1.000000 1.000000 1.000000 0.462842 1.000000 1.000000 1.000000
11 1.000000 1.000000 1.000000 0.069872 0.481064 1.000000 1.000000
12 1.000000 0.106053 0.130529 0.000034 0.000579 1.000000 0.922705
8 9 10 11 12
1 1.000000 1.000000 1.000000 1.000000 1.000000
2 1.000000 1.000000 1.000000 1.000000 0.106053
3 1.000000 1.000000 1.000000 1.000000 0.130529
4 0.018038 1.000000 0.462842 0.069872 0.000034
5 0.159561 1.000000 1.000000 0.481064 0.000579
6 1.000000 1.000000 1.000000 1.000000 1.000000
7 1.000000 1.000000 1.000000 1.000000 0.922705
8 1.000000 1.000000 1.000000 1.000000 1.000000
9 1.000000 1.000000 1.000000 1.000000 0.410355
10 1.000000 1.000000 1.000000 1.000000 0.851710
11 1.000000 1.000000 1.000000 1.000000 1.000000
12 1.000000 0.410355 0.851710 1.000000 1.000000
cnx.close()
OUTCOMES:
- To test the hypothesis regarding the difference in average ratings between podcast categories, the Kruskal-Wallis test was conducted. The test statistic obtained was 7045.29, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
- To test the hypothesis regarding the difference in rating counts between different podcasts categories, the chi-square test was conducted. The test statistic obtained was 7812.36, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. To test the hypothesis regarding the difference in average ratings between different months, an analysis of variance Kruskal-Wallis test was conducted. The test statistic obtained was 40.14, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
INSIGHTS:
- The results of the Dunn's test and Kruskal-Wallis tests indicate that there is sufficient evidence to conclude that there are differences in average ratings between podcasts categories. Specifically Arts, History, Kids & Family, Leisure, Music, Others, and Religion & Spirituality categories received significantly higher ratings compared to other categories, while Business & Finance Comedy, Government, Health & Fitness, News & Politics, Society & Culture, Sports & Recreation, TV & Film, Technology, True Crime categories received significantly lower ratings than other categories.
- The results of the chi-square test indicate that there is sufficient evidence to conclude that there are difference in ratings count between podcast categories.
- Based on the provided Dunn test results, we can conclude that there is a significant difference in ratings for April compared to other months, with April generally having lower ratings. However, there are no months with significantly higher ratings compared to others.
¶
Insights and Findings:
Difference in Podcast Categories Ratings:
- The analysis reveals significant differences in average ratings within podcast categories. Categories such as Arts, History, Kids & Family, Leisure, Music, Others, and Religion & Spirituality tend to receive higher ratings, whereas Business & Finance Comedy, Government, Health & Fitness, News & Politics, Society & Culture, Sports & Recreation, TV & Film, Technology, True Crime lower ratings on average. This highlights the importance of tailoring content and marketing strategies to better suit audience preferences within each category.
Variation in Rating Count Across Podcast Categories:
- Notable differences in ratings counts between different podcast categories suggest potential disparities in audience engagement and reach. Understanding these variations can inform decisions regarding content creation, promotion, and audience targeting strategies.
Seasonal Trends in Podcast Ratings:
- The monthly analysis shows variations in average ratings across different months. April exhibits lower average ratings compared to other months. These seasonal trends can provide valuable insights for content scheduling and promotion strategies, helping to maximize audience engagement and satisfaction throughout the year.
¶
Recommendations for Action:
Enhancing Visibility and Engagement: Given the observed differences between podcast ratings and voting counts, strategies should be developed to enhance visibility and engagement for high-quality podcasts with potentially underrepresented reach.
Audience-Centric Content Improvement: Exploring audience feedback and preferences can identify areas for content improvement. Analyzing listener reviews, ratings, and engagement metrics can help understand what resonates most with the audience, enabling adjustments to content strategies for better alignment with audience preferences.
¶
Further Areas for Investigation:
Factors Influencing Voting Counts: Additional investigation is needed to explore factors influencing differences in voting counts among podcasts with similar average ratings. This could involve analyzing promotion strategies, audience demographics, or platform visibility to optimize audience engagement.
Understanding Variation in Ratings Across Categories: Further examination of variables contributing to variations in average ratings across podcast categories, such as content format, host expertise, or audience demographics, can provide deeper insights into audience preferences and content performance.